Chapter 3 - Applied Exercises

Author

Ricardo J. Serrano

# import libraries
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
# from sklearn.linear_model import LinearRegression
import scipy.stats as stats

# import data visualisation tools
import matplotlib.pyplot as plt
import xkcd
%matplotlib inline
# from matplotlib import pylab
# import plotly.plotly as py
# import plotly.graph_objs as go
import seaborn as sns
plt.style.use('seaborn-v0_8-whitegrid')

import warnings
warnings.filterwarnings('ignore')

plt.rcParams['figure.figsize'] = (10, 8)

3.8

This question involves the use of simple linear regression on the Auto data set.

Data dictionary

mpg - miles per gallon

cylinders - number of cylinders between 3 and 8

displacement - engine displacement (cu. inches)

horsepower - engine horsepower

weight - vehicle weight (lbs.)

acceleration - time to accelerate from 0 to 60 mph (sec.)

year - model year (modulo 100)

origin - vehicle origin (1. American, 2. European, 3. Japanese)

name - vehicle name

Exploratory Data Analysis (EDA)

url = "../Data/Auto.csv"
Auto = pd.read_csv(url)

First five rows head()

Auto.head()
mpg cylinders displacement horsepower weight acceleration year origin name
0 18.0 8 307.0 130 3504 12.0 70 1 chevrolet chevelle malibu
1 15.0 8 350.0 165 3693 11.5 70 1 buick skylark 320
2 18.0 8 318.0 150 3436 11.0 70 1 plymouth satellite
3 16.0 8 304.0 150 3433 12.0 70 1 amc rebel sst
4 17.0 8 302.0 140 3449 10.5 70 1 ford torino

Auto dataset variable types

Auto.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 397 entries, 0 to 396
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           397 non-null    float64
 1   cylinders     397 non-null    int64  
 2   displacement  397 non-null    float64
 3   horsepower    397 non-null    int64  
 4   weight        397 non-null    int64  
 5   acceleration  397 non-null    float64
 6   year          397 non-null    int64  
 7   origin        397 non-null    int64  
 8   name          397 non-null    object 
dtypes: float64(3), int64(5), object(1)
memory usage: 28.0+ KB

Verify missing values

Auto.isnull().sum().sum()
0

No missing values!

Auto descriptive statistics

Auto.describe().T
count mean std min 25% 50% 75% max
mpg 397.0 23.515869 7.825804 9.0 17.5 23.0 29.0 46.6
cylinders 397.0 5.458438 1.701577 3.0 4.0 4.0 8.0 8.0
displacement 397.0 193.532746 104.379583 68.0 104.0 146.0 262.0 455.0
horsepower 397.0 104.196474 38.357393 46.0 75.0 93.0 125.0 230.0
weight 397.0 2970.261965 847.904119 1613.0 2223.0 2800.0 3609.0 5140.0
acceleration 397.0 15.555668 2.749995 8.0 13.8 15.5 17.1 24.8
year 397.0 75.994962 3.690005 70.0 73.0 76.0 79.0 82.0
origin 397.0 1.574307 0.802549 1.0 1.0 1.0 2.0 3.0

Auto histograms (numerical variables)

Auto.hist()
array([[<Axes: title={'center': 'mpg'}>,
        <Axes: title={'center': 'cylinders'}>,
        <Axes: title={'center': 'displacement'}>],
       [<Axes: title={'center': 'horsepower'}>,
        <Axes: title={'center': 'weight'}>,
        <Axes: title={'center': 'acceleration'}>],
       [<Axes: title={'center': 'year'}>,
        <Axes: title={'center': 'origin'}>, <Axes: >]], dtype=object)

  1. Use the sm.OLS() function to perform a simple linear regression with mpg as the response and horsepower as the predictor. Use the summary() function to print the results. Comment on the output.
y = Auto.mpg.astype(float)
x = Auto.horsepower.astype(float)
X = sm.add_constant(x)
model = sm.OLS(y, X).fit()

Model summary

model.summary()
OLS Regression Results
Dep. Variable: mpg R-squared: 0.604
Model: OLS Adj. R-squared: 0.603
Method: Least Squares F-statistic: 603.4
Date: Sun, 10 Sep 2023 Prob (F-statistic): 1.50e-81
Time: 15:19:52 Log-Likelihood: -1195.5
No. Observations: 397 AIC: 2395.
Df Residuals: 395 BIC: 2403.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 40.0426 0.717 55.862 0.000 38.633 41.452
horsepower -0.1586 0.006 -24.565 0.000 -0.171 -0.146
Omnibus: 16.479 Durbin-Watson: 0.925
Prob(Omnibus): 0.000 Jarque-Bera (JB): 17.349
Skew: 0.494 Prob(JB): 0.000171
Kurtosis: 3.271 Cond. No. 322.


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
  1. Is there a relationship between the predictor and the response?

Yes. The R-squared between mpg and horsepower is 0.604. This means that the regression model is able to explain at least 60.4% of the variability in mpg (page 79).

Also, the F-statistic p-value is almost 0, which indicates that the linear regression model is significant (null hypothesis is model with intercept only).

  1. How strong is the relationship between the predictor and the response?

RSE (residual standard error)

model.resid.std(ddof=X.shape[1])
4.928554656293288

Since the mean of mpg is 23.5158, the precentage error is approximately 21% (RSE / mean(mpg)).

  1. Is the relationship between the predictor and the response positive or negative?
values = slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)

print(f'Slope: {values[0]:.4f}')
print(f'Intercept (constant): {values[1]:.4f}')
print(f'R-value (Pearson coefficient): {values[2]:.4f}')
print(f'R-squared (coefficient of determination): {values[2]**2:.4f}')
print(f'p-value: {values[3]:.4f}')
Slope: -0.1586
Intercept (constant): 40.0426
R-value (Pearson coefficient): -0.7774
R-squared (coefficient of determination): 0.6044
p-value: 0.0000

Plot linear regression model

plt.figure(figsize=(16, 8))
plotdata = pd.concat([x, y], axis = 1)
sns.lmplot(x = "horsepower", y = "mpg", data = plotdata)
fig = plt.gcf()
fig.set_size_inches(16, 8)
plt.show()
<Figure size 1536x768 with 0 Axes>

Negative. An increase in horsepower is related to a decrease in mpg.

  1. What is the predicted mpg associated with a horsepower of 98? What are the associated 95 % confidence and prediction intervals?
# confidence interval/prediction confidence interval (statsmodels)
new_data = np.array([1, 98])
pred = model.get_prediction(new_data)
pred.summary_frame(alpha = 0.05)
mean mean_se mean_ci_lower mean_ci_upper obs_ci_lower obs_ci_upper
0 24.498698 0.250572 24.006076 24.991319 14.796705 34.200691

Confidence and prediction intervals: What’s the difference?

  1. Produce some of diagnostic plots of the least squares regression fit as described in the lab. Comment on any problems you see with the fit.

Source: Regression tutorial

Residuals histogram

sns.histplot(model.resid)
<Axes: ylabel='Count'>

Calculate the mean (mu) and standard deviation (std) for residuals distribution

mu, std = stats.norm.fit(model.resid)
print(mu, std)
-1.4175059111385374e-14 4.916124486044523

Shapiro-Wilk noramlity test

stats.shapiro(model.resid)
ShapiroResult(statistic=0.9817615151405334, pvalue=6.613603181904182e-05)

Residuals vs. Fitted Values

Source: From R to Python

from statsmodels.nonparametric.smoothers_lowess import lowess

# function to plot residuals vs. fitted values
def resid_fitted_plot(model):

    residuals = model.resid
    fitted = model.fittedvalues
    smoothed = lowess(residuals, fitted)
    top3 = abs(residuals).sort_values(ascending = False)[:3]

    plt.rcParams.update({'font.size': 16})
    fig, ax = plt.subplots()
    ax.scatter(fitted, residuals, edgecolors = 'k', facecolors = 'none')
    ax.plot(smoothed[:,0],smoothed[:,1],color = 'r')
    ax.set_ylabel('Residuals')
    ax.set_xlabel('Fitted Values')
    ax.set_title('Residuals vs. Fitted')
    ax.plot([min(fitted), max(fitted)],[0,0],color = 'k',linestyle = ':', alpha = .3)

    for i in top3.index:
        ax.annotate(i, xy=(fitted[i],residuals[i]))

    plt.show()

resid_fitted_plot(model = model)

The plot clearly shows a non-linear relationship between the residuals and fitted values.

Q-Q plot

sm.qqplot(model.resid, line='s')
plt.show()

Most of the residuals follow the 1:1 line, but the tail ends deviate from this line.

Standardized residuals vs. Fitted Values (homoskedasticity test)

# function to plot standardized residuals vs. fitted values
def std_resid_fitted_plot(model):

    student_residuals = model.get_influence().resid_studentized_internal
    fitted = model.fittedvalues
    sqrt_student_residuals = pd.Series(np.sqrt(np.abs(student_residuals)))
    sqrt_student_residuals.index = model.resid.index
    smoothed = lowess(sqrt_student_residuals, fitted)
    top3 = abs(sqrt_student_residuals).sort_values(ascending = False)[:3]

    fig, ax = plt.subplots()
    ax.scatter(fitted, sqrt_student_residuals, edgecolors = 'k', facecolors = 'none')
    ax.plot(smoothed[:,0],smoothed[:,1],color = 'r')
    ax.set_ylabel('$\sqrt{|Studentized \ Residuals|}$')
    ax.set_xlabel('Fitted Values')
    ax.set_title('Scale-Location')
    ax.set_ylim(0,max(sqrt_student_residuals)+0.1)
    for i in top3.index:
        ax.annotate(i,xy=(fitted[i],sqrt_student_residuals[i]))
    plt.show()

std_resid_fitted_plot(model = model)

The lowess smoother upward trend is indicative of heteroskedasticity.

3.9

Continue using the Auto dataset.

  1. Produce a scatterplot matrix which includes all of the variables in the data set.
plt.figure(figsize=(10, 6))
sns.pairplot(Auto, palette='Dark2')
<seaborn.axisgrid.PairGrid at 0x29e7856d0>
<Figure size 960x576 with 0 Axes>

With hue = origin

sns.pairplot(Auto, hue = 'origin', palette='Dark2')
<seaborn.axisgrid.PairGrid at 0x2a2308c70>

  1. Compute the matrix of correlations between the variables using the DataFrame.corr() method.
Auto.corr()
mpg cylinders displacement horsepower weight acceleration year origin
mpg 1.000000 -0.776260 -0.804443 -0.777416 -0.831739 0.422297 0.581469 0.563698
cylinders -0.776260 1.000000 0.950920 0.843075 0.897017 -0.504061 -0.346717 -0.564972
displacement -0.804443 0.950920 1.000000 0.896921 0.933104 -0.544162 -0.369804 -0.610664
horsepower -0.777416 0.843075 0.896921 1.000000 0.863422 -0.686353 -0.418909 -0.453047
weight -0.831739 0.897017 0.933104 0.863422 1.000000 -0.419502 -0.307900 -0.581265
acceleration 0.422297 -0.504061 -0.544162 -0.686353 -0.419502 1.000000 0.282901 0.210084
year 0.581469 -0.346717 -0.369804 -0.418909 -0.307900 0.282901 1.000000 0.184314
origin 0.563698 -0.564972 -0.610664 -0.453047 -0.581265 0.210084 0.184314 1.000000

Correlation heatmap

sns.heatmap(Auto.corr(), vmin=-1, vmax=1, annot=True, cmap='BrBG')
<Axes: >

Pearson correlation p-values

# Source: https://www.statology.org/p-value-correlation-pandas/
#create function to calculate p-values for each pairwise correlation coefficient
from scipy.stats import pearsonr

Auto_num = Auto.drop('name', axis = 1)

def r_pvalues(df):
    cols = pd.DataFrame(columns=df.columns)
    p = cols.transpose().join(cols, how='outer')
    for r in df.columns:
        for c in df.columns:
            tmp = df[df[r].notnull() & df[c].notnull()]
            p[r][c] = round(pearsonr(tmp[r], tmp[c])[1], 4)
    return p

#use custom function to calculate p-values
r_pvalues(Auto_num)
mpg cylinders displacement horsepower weight acceleration year origin
mpg 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
cylinders 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
displacement 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
horsepower 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
weight 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
acceleration 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
year 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0002
origin 0.0 0.0 0.0 0.0 0.0 0.0 0.0002 0.0
  1. Use the sm.OLS() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results.
# X = Auto[['cylinders', 'displacement', 'horsepower', 'weight',
#        'acceleration', 'year', 'origin']]
# Y = Auto['mpg']
# X1 = sm.add_constant(X)
# reg = sm.OLS(Y, X1).fit()
# `cylinders` and `origin` are categorical variables
reg = ols('mpg ~ C(cylinders) + displacement + horsepower + weight + acceleration + year + C(origin)', data = Auto).fit()

Model summary

reg.summary()
OLS Regression Results
Dep. Variable: mpg R-squared: 0.847
Model: OLS Adj. R-squared: 0.843
Method: Least Squares F-statistic: 194.1
Date: Sun, 10 Sep 2023 Prob (F-statistic): 1.60e-149
Time: 15:20:07 Log-Likelihood: -1006.7
No. Observations: 397 AIC: 2037.
Df Residuals: 385 BIC: 2085.
Df Model: 11
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -22.7941 4.531 -5.030 0.000 -31.704 -13.885
C(cylinders)[T.4] 6.6322 1.656 4.005 0.000 3.377 9.888
C(cylinders)[T.5] 6.9389 2.518 2.756 0.006 1.988 11.890
C(cylinders)[T.6] 3.3943 1.826 1.859 0.064 -0.196 6.984
C(cylinders)[T.8] 5.2162 2.110 2.472 0.014 1.067 9.365
C(origin)[T.2] 1.9222 0.542 3.546 0.000 0.856 2.988
C(origin)[T.3] 2.6022 0.524 4.970 0.000 1.573 3.632
displacement 0.0190 0.007 2.641 0.009 0.005 0.033
horsepower -0.0316 0.013 -2.438 0.015 -0.057 -0.006
weight -0.0060 0.001 -9.672 0.000 -0.007 -0.005
acceleration 0.0503 0.091 0.551 0.582 -0.129 0.230
year 0.7451 0.049 15.338 0.000 0.650 0.841
Omnibus: 40.976 Durbin-Watson: 1.319
Prob(Omnibus): 0.000 Jarque-Bera (JB): 72.643
Skew: 0.627 Prob(JB): 1.68e-16
Kurtosis: 4.679 Cond. No. 9.34e+04


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 9.34e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Comments:

  1. The model R-squared is 0.847, higher than the simple linear regression model studied in 3.8.
  2. All variables are statistically signficant (i.e., p-values < 0.05), except acceleration.
  3. Some model variables have strong multicollinearity.
  1. Is there a relationship between the predictors and the response? Use the anova_lm() function from statsmodels to answer this question.
sm.stats.anova_lm(reg, typ = 2)
sum_sq df F PR(>F)
C(cylinders) 552.287462 4.0 14.345651 6.302814e-11
C(origin) 253.698889 2.0 13.179642 2.907966e-06
displacement 67.126803 1.0 6.974467 8.604694e-03
horsepower 57.206505 1.0 5.943749 1.522114e-02
weight 900.333829 1.0 93.544579 5.956696e-20
acceleration 2.920793 1.0 0.303470 5.820347e-01
year 2264.131446 1.0 235.242990 8.926148e-42
Residual 3705.490256 385.0 NaN NaN
  1. Which predictors appear to have a statistically significant relationship to the response?

The anova table reaffirms the conclusion that the predictors relationship to the response mpg is statistically significant, except acceleration.

  1. What does the coefficient for the year variable suggest?

Since the year coefficient is positive (0.7451), an increase in year also means an increase in mpg (~0.75 mpg/year).

  1. Produce some of diagnostic plots of the linear regression fit as described in the lab. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

Residuals vs. Fitted Values

resid_fitted_plot(model = reg)

The plot shows a non-linear relationship between the residuals and fitted values.

Q-Q plot

sm.qqplot(reg.resid, line='s')
plt.show()

Most of the residuals follow the 1:1 line, but the tail ends deviate from this line.

Standardized residuals vs. Fitted Values (homoskedasticity test)

std_resid_fitted_plot(model = reg)

The lowess smoother upward trend is indicative of heteroskedasticity.

  1. Fit some models with interactions as described in the lab. Do any interactions appear to be statistically significant?

Let’s try interaction between cylinders and displacement

reg_1 = ols('mpg ~ C(cylinders) + displacement + horsepower + weight + acceleration + year + C(origin) + C(cylinders) * displacement', data = Auto).fit()

Model summary (reg_1)

reg_1.summary()
OLS Regression Results
Dep. Variable: mpg R-squared: 0.868
Model: OLS Adj. R-squared: 0.863
Method: Least Squares F-statistic: 167.6
Date: Sun, 10 Sep 2023 Prob (F-statistic): 2.80e-157
Time: 15:20:08 Log-Likelihood: -977.12
No. Observations: 397 AIC: 1986.
Df Residuals: 381 BIC: 2050.
Df Model: 15
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -39.0460 24.487 -1.595 0.112 -87.193 9.101
C(cylinders)[T.4] 30.0936 24.330 1.237 0.217 -17.745 77.932
C(cylinders)[T.5] 28.2114 25.984 1.086 0.278 -22.879 79.302
C(cylinders)[T.6] 19.3454 24.471 0.791 0.430 -28.770 67.460
C(cylinders)[T.8] 9.1698 24.453 0.375 0.708 -38.910 57.250
C(origin)[T.2] 0.3880 0.544 0.713 0.476 -0.682 1.458
C(origin)[T.3] 0.9712 0.533 1.822 0.069 -0.077 2.019
displacement 0.2082 0.335 0.621 0.535 -0.451 0.868
C(cylinders)[T.4]:displacement -0.2810 0.335 -0.839 0.402 -0.940 0.378
C(cylinders)[T.5]:displacement -0.2496 0.341 -0.732 0.464 -0.920 0.420
C(cylinders)[T.6]:displacement -0.2118 0.335 -0.631 0.528 -0.871 0.448
C(cylinders)[T.8]:displacement -0.1735 0.335 -0.518 0.605 -0.833 0.486
horsepower -0.0443 0.013 -3.392 0.001 -0.070 -0.019
weight -0.0040 0.001 -6.231 0.000 -0.005 -0.003
acceleration -0.0684 0.087 -0.784 0.433 -0.240 0.103
year 0.7731 0.046 16.682 0.000 0.682 0.864
Omnibus: 52.889 Durbin-Watson: 1.332
Prob(Omnibus): 0.000 Jarque-Bera (JB): 109.572
Skew: 0.725 Prob(JB): 1.61e-24
Kurtosis: 5.126 Cond. No. 1.16e+06


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.16e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

The interaction between cylinders and displacement does not appear to be statistically significant (i.e., p-values > 0.05)

Let’s try interaction between weight and displacement

reg_2 = ols('mpg ~ C(cylinders) + displacement + horsepower + weight + acceleration + year + C(origin) + weight * displacement', data = Auto).fit()

Model summary (reg_2)

reg_2.summary()
OLS Regression Results
Dep. Variable: mpg R-squared: 0.870
Model: OLS Adj. R-squared: 0.866
Method: Least Squares F-statistic: 214.0
Date: Sun, 10 Sep 2023 Prob (F-statistic): 1.03e-161
Time: 15:20:08 Log-Likelihood: -974.79
No. Observations: 397 AIC: 1976.
Df Residuals: 384 BIC: 2027.
Df Model: 12
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -11.8994 4.393 -2.708 0.007 -20.538 -3.261
C(cylinders)[T.4] 6.8805 1.530 4.497 0.000 3.872 9.889
C(cylinders)[T.5] 9.2080 2.343 3.930 0.000 4.601 13.815
C(cylinders)[T.6] 6.3181 1.724 3.664 0.000 2.928 9.709
C(cylinders)[T.8] 6.8359 1.960 3.488 0.001 2.983 10.689
C(origin)[T.2] 1.2236 0.508 2.408 0.017 0.225 2.223
C(origin)[T.3] 1.3543 0.507 2.670 0.008 0.357 2.352
displacement -0.0618 0.012 -5.188 0.000 -0.085 -0.038
horsepower -0.0328 0.012 -2.736 0.007 -0.056 -0.009
weight -0.0101 0.001 -13.264 0.000 -0.012 -0.009
acceleration 0.0226 0.084 0.267 0.790 -0.144 0.189
year 0.7842 0.045 17.373 0.000 0.695 0.873
weight:displacement 2.103e-05 2.57e-06 8.183 0.000 1.6e-05 2.61e-05
Omnibus: 50.785 Durbin-Watson: 1.334
Prob(Omnibus): 0.000 Jarque-Bera (JB): 106.570
Skew: 0.694 Prob(JB): 7.22e-24
Kurtosis: 5.125 Cond. No. 2.61e+07


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.61e+07. This might indicate that there are
strong multicollinearity or other numerical problems.

The interaction between weight and displacement is statistically significant, and improves the R-squared metric.

  1. Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.

Log transform horsepower Square root acceleration

reg_3 = ols('mpg ~ C(cylinders) + displacement + horsepower + weight + acceleration + year + C(origin) + np.log(horsepower) + np.sqrt(acceleration)', data = Auto).fit()

Model summary (reg_3)

reg_3.summary()
OLS Regression Results
Dep. Variable: mpg R-squared: 0.870
Model: OLS Adj. R-squared: 0.865
Method: Least Squares F-statistic: 196.7
Date: Sun, 10 Sep 2023 Prob (F-statistic): 1.93e-160
Time: 15:20:08 Log-Likelihood: -975.02
No. Observations: 397 AIC: 1978.
Df Residuals: 383 BIC: 2034.
Df Model: 13
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 86.2208 17.279 4.990 0.000 52.248 120.194
C(cylinders)[T.4] 7.0212 1.551 4.526 0.000 3.971 10.072
C(cylinders)[T.5] 7.7135 2.341 3.296 0.001 3.112 12.315
C(cylinders)[T.6] 5.5607 1.730 3.215 0.001 2.160 8.961
C(cylinders)[T.8] 6.8930 1.987 3.469 0.001 2.986 10.800
C(origin)[T.2] 0.7924 0.524 1.512 0.131 -0.238 1.823
C(origin)[T.3] 1.8944 0.493 3.843 0.000 0.925 2.864
displacement -0.0068 0.008 -0.897 0.370 -0.022 0.008
horsepower 0.1201 0.026 4.565 0.000 0.068 0.172
weight -0.0034 0.001 -5.018 0.000 -0.005 -0.002
acceleration 1.5061 0.982 1.533 0.126 -0.426 3.438
year 0.7265 0.045 16.133 0.000 0.638 0.815
np.log(horsepower) -20.3411 2.887 -7.045 0.000 -26.018 -14.664
np.sqrt(acceleration) -14.3650 7.951 -1.807 0.072 -29.998 1.269
Omnibus: 35.287 Durbin-Watson: 1.564
Prob(Omnibus): 0.000 Jarque-Bera (JB): 67.330
Skew: 0.524 Prob(JB): 2.40e-15
Kurtosis: 4.724 Cond. No. 3.95e+05


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.95e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

The log(horsepower) is still statistically significant. The sqrt(acceleration) remains not statistically significant.

3.10

This question should be answered using the Carseats data set.

Data dictionary

A data frame with 400 observations on the following 11 variables.

Sales - Unit sales (in thousands) at each location

CompPrice - Price charged by competitor at each location

Income - Community income level (in thousands of dollars)

Advertising - Local advertising budget for company at each location (in thousands of dollars)

Population - Population size in region (in thousands)

Price - Price company charges for car seats at each site

ShelveLoc - A factor with levels Bad, Good and Medium indicating the quality of the shelving location for the car seats at each site

Age - Average age of the local population

Education - Education level at each location

Urban - A factor with levels No and Yes to indicate whether the store is in an urban or rural location

US - A factor with levels No and Yes to indicate whether the store is in the US or not

Exploratory Data Analysis (EDA)

file = "../Data/Carseats.csv"
CarSeats = pd.read_csv(file)

First five rows head()

CarSeats.head()
SlNo Sales CompPrice Income Advertising Population Price ShelveLoc Age Education Urban US
0 1 9.50 138 73 11 276 120 Bad 42 17 Yes Yes
1 2 11.22 111 48 16 260 83 Good 65 10 Yes Yes
2 3 10.06 113 35 10 269 80 Medium 59 12 Yes Yes
3 4 7.40 117 100 4 466 97 Medium 55 14 Yes Yes
4 5 4.15 141 64 3 340 128 Bad 38 13 Yes No

CarSeats dataset variable types

CarSeats.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   SlNo         400 non-null    int64  
 1   Sales        400 non-null    float64
 2   CompPrice    400 non-null    int64  
 3   Income       400 non-null    int64  
 4   Advertising  400 non-null    int64  
 5   Population   400 non-null    int64  
 6   Price        400 non-null    int64  
 7   ShelveLoc    400 non-null    object 
 8   Age          400 non-null    int64  
 9   Education    400 non-null    int64  
 10  Urban        400 non-null    object 
 11  US           400 non-null    object 
dtypes: float64(1), int64(8), object(3)
memory usage: 37.6+ KB

Verify missing values

CarSeats.isnull().sum().sum()
0

No missing values!

CarSeats descriptive statistics

CarSeats.describe().T
count mean std min 25% 50% 75% max
SlNo 400.0 200.500000 115.614301 1.0 100.75 200.50 300.25 400.00
Sales 400.0 7.496325 2.824115 0.0 5.39 7.49 9.32 16.27
CompPrice 400.0 124.975000 15.334512 77.0 115.00 125.00 135.00 175.00
Income 400.0 68.657500 27.986037 21.0 42.75 69.00 91.00 120.00
Advertising 400.0 6.635000 6.650364 0.0 0.00 5.00 12.00 29.00
Population 400.0 264.840000 147.376436 10.0 139.00 272.00 398.50 509.00
Price 400.0 115.795000 23.676664 24.0 100.00 117.00 131.00 191.00
Age 400.0 53.322500 16.200297 25.0 39.75 54.50 66.00 80.00
Education 400.0 13.900000 2.620528 10.0 12.00 14.00 16.00 18.00

CarSeats histograms (numerical variables)

CarSeats.hist()
array([[<Axes: title={'center': 'SlNo'}>,
        <Axes: title={'center': 'Sales'}>,
        <Axes: title={'center': 'CompPrice'}>],
       [<Axes: title={'center': 'Income'}>,
        <Axes: title={'center': 'Advertising'}>,
        <Axes: title={'center': 'Population'}>],
       [<Axes: title={'center': 'Price'}>,
        <Axes: title={'center': 'Age'}>,
        <Axes: title={'center': 'Education'}>]], dtype=object)

  1. Fit a multiple regression model to predict Sales using Price, Urban, and US.
reg = ols(formula = 'Sales ~ Price + C(Urban) + C(US)', data = CarSeats).fit() # C prepares categorical data for regression

Model summary

reg.summary()
OLS Regression Results
Dep. Variable: Sales R-squared: 0.239
Model: OLS Adj. R-squared: 0.234
Method: Least Squares F-statistic: 41.52
Date: Sun, 10 Sep 2023 Prob (F-statistic): 2.39e-23
Time: 15:20:09 Log-Likelihood: -927.66
No. Observations: 400 AIC: 1863.
Df Residuals: 396 BIC: 1879.
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 13.0435 0.651 20.036 0.000 11.764 14.323
C(Urban)[T.Yes] -0.0219 0.272 -0.081 0.936 -0.556 0.512
C(US)[T.Yes] 1.2006 0.259 4.635 0.000 0.691 1.710
Price -0.0545 0.005 -10.389 0.000 -0.065 -0.044
Omnibus: 0.676 Durbin-Watson: 1.912
Prob(Omnibus): 0.713 Jarque-Bera (JB): 0.758
Skew: 0.093 Prob(JB): 0.684
Kurtosis: 2.897 Cond. No. 628.


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
  1. Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

For a unit increase of price ceterus paribus, the sales decrease by 0.0545 units. Likewise, for a unit increase in an urban setting ceterus paribus the sales decrease by 0.219 units. Likewise, for a location in the US a unit increase of another store ceterus paribus increases the sales by 1.2006 units.

  1. Write out the model in equation form, being careful to handle the qualitative variables properly.

\(\hat{y} = 13.0435 + (-0.0219 \times Urban) + (1.2006 \times US) + (-0.0545 \times Price)\)

Where Urban and US are encoded as dummy variables:

  • Urban: Yes => 1
  • Urban:No => 0
  • US: Yes => 1
  • US: No => 0
  1. For which of the predictors can you reject the null hypothesis \(H_0\) : \(β_j\) = 0?

We can reject “Urban” predictor, given it’s high p-value(0.936).

  1. On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.
reg_1 = ols(formula = 'Sales ~ Price + C(US)', data = CarSeats).fit()

Model summary (reg_1)

reg_1.summary()
OLS Regression Results
Dep. Variable: Sales R-squared: 0.239
Model: OLS Adj. R-squared: 0.235
Method: Least Squares F-statistic: 62.43
Date: Sun, 10 Sep 2023 Prob (F-statistic): 2.66e-24
Time: 15:20:09 Log-Likelihood: -927.66
No. Observations: 400 AIC: 1861.
Df Residuals: 397 BIC: 1873.
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 13.0308 0.631 20.652 0.000 11.790 14.271
C(US)[T.Yes] 1.1996 0.258 4.641 0.000 0.692 1.708
Price -0.0545 0.005 -10.416 0.000 -0.065 -0.044
Omnibus: 0.666 Durbin-Watson: 1.912
Prob(Omnibus): 0.717 Jarque-Bera (JB): 0.749
Skew: 0.092 Prob(JB): 0.688
Kurtosis: 2.895 Cond. No. 607.


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
  1. How well do the models in (a) and (e) fit the data?

Considering the R-squared and adjusted R-squared parameters, both models have a similar precentage of explaining the variation on the response variable (Sales). However, the (e) model has less predictors and achieves the same performance as (a). Therefore, the (e) model is a more efficient model than (a).

  1. Using the model from (e), obtain 95% confidence intervals for the coefficient(s).

From the model summary table:

Intercept: (11.790, 14.271) US: (0.692, 1.708) Price: (-0.065, -0.044)

  1. Is there evidence of outliers or high leverage observations in the model from (e)?

Residuals vs. Fitted Values

resid_fitted_plot(model = reg_1)

Potential outliers: observations 50, 68 and 376.

Residuals vs. Leverage plot

fig = plt.figure(figsize = (25, 15))
fig.set_size_inches(30, fig.get_figheight(), forward=True)
sm.graphics.influence_plot(reg_1, criterion="cooks", size = 0.0002**2)
plt.title("Residuals vs. Leverage")
fig = plt.gcf()
fig.set_size_inches(25, 15)
plt.show()
<Figure size 2880x1440 with 0 Axes>

Observation 42 is a high leverage point.